[v5] Return a BatchEncoding dict from apply_chat_template by default#41626
[v5] Return a BatchEncoding dict from apply_chat_template by default#41626Rocketknight1 merged 11 commits intomainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
It's a v5 breaking change so cc @LysandreJik @Cyrilvallez @ArthurZucker for review to make sure you're okay with it |
0f4f200 to
9330959
Compare
| assert self.rust_tokenizer_3b([" Sam", "Sam"]).input_ids == [[5502, 2], [5502, 2]] | ||
|
|
||
| @require_jinja | ||
| def test_tokenization_for_chat(self): |
There was a problem hiding this comment.
Out of curiosity, why do you remove this test?
There was a problem hiding this comment.
It's extremely old - it's not related to this PR really, but these tests come from before chat templates, and we just patched them to support chat templates after chat templates were added. They only exist for a few models, and I think we don't want to keep them, because it's not clear what they test that the main chat template tests don't.
| ] | ||
|
|
||
| output = self.tokenizer.apply_chat_template(conversation, tokenize=True) | ||
| output = self.tokenizer.apply_chat_template(conversation, tokenize=True).input_ids |
There was a problem hiding this comment.
nit for consistency, since tokenize=True by default, I guess you can remove it
| ] | ||
| with self.assertRaises(ValueError): | ||
| tokenizer.encode_message_with_chat_template(conversation[0], add_generation_prompt=True) | ||
| tokenizer.encode_message_with_chat_template(conversation[0], add_generation_prompt=True, return_dict=False) |
There was a problem hiding this comment.
Nit again: Not sure if this change is needed
b729fb7 to
c452142
Compare
2acbd45 to
abf7158
Compare
…nderlying tokenizer
abf7158 to
137015e
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: blenderbot, bloom, cohere, gemma, gpt2, gpt_sw3, llama, voxtral |
|
This change is a bit unfortunate. Seems like an unnecessary API break which is going to cause a headache for a lot of people. You can see all the places we use this in mlx-lm and also all the models uploaded to Hugging Face Hub have code snippets which include this (see e.g https://huggingface.co/mlx-community/mistralai_Devstral-Small-2-24B-Instruct-2512-MLX-6Bit). |
|
It is, and I'm sorry - but the |
|
Yes, as @Rocketknight1 said, we believe it makes much more sense for most users than it breaks current code... We unfortunately cannot go forward without breaking a few eggs... But we always provide a solution to keep the exact same behavior as before! |
…uggingface#41626) * Flip the default return type for `apply_chat_template` to match the underlying tokenizer * Remove test_tokenization_for_chat tests, which no longer do anything useful * Remove test_tokenization_for_chat tests, which no longer do anything useful * Fix test_encode_message tests * Fix test_encode_message tests * Return dicts for Processor too * Fix mistral-common tests * Catch one of the processors too * revert test bug! * nit fix * nit fix
Tokenizers return a BatchEncoding dict by default, but
apply_chat_templatedoesn't. This is just an accident of how I wrote it originally, which we were stuck with for backward compatibility reasons. Ideally, I thinkapply_chat_templateshould return exactly the same format as tokenizers, since it also performs tokenization most of the time. It's nowv5time, so we can start making that happen 😅This PR also updates tests, and removes very old
test_tokenization_for_chattests. These model-specific tests don't do anything useful anymore, since theapply_chat_templatefunctionality is unified across tokenizers; they're mostly a legacy leftover from when model classes used to need custom chat tokenization functions.